Coefficient of determination

https://en.wikipedia.org/wiki/Coefficient_of_determination

Definition

$R^2$ . One minus the residual sum of squares (unexplained variance), normalized by the total sum of squares. See the figure in https://en.wikipedia.org/wiki/Coefficient_of_determination#Definitions

Let $\hat y$ to be the predicted value of $y$ by the given model. As the model becomes better, the residual sum of squares $\sum_i (y_i - \hat{y}_i)^2$ should decrease. But how small is small?

One baseline “model” that we can assume is to predict $y_i = \bar{y}$ where $\bar y = \sum_i y_i /N$ (mean). In other words, the sum of squares for this “null” model $\sum_i (y_i - \bar{y})^2$ gives us a baseline. The reduction of residual sum of squares (RSS) by the regression can be written as: $\sum_i (y_i - \bar{y})^2 - \sum_i (y_i - \hat{y}_i)^2$ We can then normalize this by variance in the data. $R^2$ is now simply: $R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}.$

Adjusted $R^2$

More explanatory variables spuriously increase the value, so the adjusted $R^2$ is often used when there are multiple explanatory variables.

Issues

Why I’m Not a Fan of R-Squared by John Myles White
Cosma Shalizi‘s “rant” about $R^2$ : http://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/10/lecture-10.pdf

The issue with models without intercept

Removal of statistically significant intercept term increases 𝑅2 in linear model

Supposedly, R uses a baseline model that is $y_i = 0$ , instead of $y_i = \bar{y}$ when there is no intercept.

from sklearn.linear_model import LinearRegression
import numpy as np

# Generate synthetic data
np.random.seed(0)
n = 100
X = np.linspace(0, 10, n).reshape(-1, 1)
y = 3 + 2 * X.flatten() + np.random.normal(0, 1, n)

# Fit linear model with intercept
model_with_intercept = LinearRegression(fit_intercept=True)
model_with_intercept.fit(X, y)
y_pred_with_intercept = model_with_intercept.predict(X)
r2_with_intercept = model_with_intercept.score(X, y)

# Fit linear model without intercept
model_without_intercept = LinearRegression(fit_intercept=False)
model_without_intercept.fit(X, y)
y_pred_without_intercept = model_without_intercept.predict(X)
r2_without_intercept = model_without_intercept.score(X, y)

# Compute SST for models with and without intercept
SST_with_intercept = np.sum((y - np.mean(y))**2)
SST_without_intercept = np.sum(y**2)

# Compute R^2 manually for model without intercept
SSE_without_intercept = np.sum((y - y_pred_without_intercept)**2)
r2_without_intercept_manual = 1 - SSE_without_intercept / SST_without_intercept

r2_with_intercept, r2_without_intercept, r2_without_intercept_manual, SST_with_intercept, SST_without_intercept

# Load required library
library(lmtest)

# Generate synthetic data
set.seed(0)
n <- 100
X <- seq(0, 10, length.out = n)
y <- 3 + 2 * X + rnorm(n, 0, 1)

# Fit linear model with intercept
model_with_intercept <- lm(y ~ X)
summary(model_with_intercept)

# Compute SST for model with intercept
SST_with_intercept <- sum((y - mean(y))^2)

# Fit linear model without intercept
model_without_intercept <- lm(y ~ X - 1)
summary(model_without_intercept)

# Compute SST for model without intercept
SST_without_intercept <- sum(y^2)

# Compute R^2 manually for model without intercept
SSE_without_intercept <- sum(residuals(model_without_intercept)^2)
R2_without_intercept_manual <- 1 - SSE_without_intercept / SST_without_intercept

cat("SST with intercept:", SST_with_intercept, "\n")
cat("SST without intercept:", SST_without_intercept, "\n")
cat("R2 without intercept (manual):", R2_without_intercept_manual, "\n")

Coefficient of determination

Definition

Adjusted R^2

Issues

The issue with models without intercept

Adjusted $R^2$